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Abstract 

Probabilistic models over strings have played a key role in developing methods allowing indels 
to be treated as phylogenetically informative events [111 1221 1181 1281 I3UI 1291 [3J. In this work, 
we show how to derive a closed-form expression for computing the probability of a collection of 
observed sequences given a tree. This expression is in terms of elementary operations on indexed 
collections of matrices, and can be derived for a wide range of indel models. 

keywords indel, alignment, probabilistic models, TKF91, string transducers, automata, graphical 
models, phylogenetics, factor graphs 

1 Introduction 

This work is motivated by models of evolution that go beyond substitutions and also incorporate 
insertions and deletions as phylogenetically informative events. In recent years, several groups have 
been working on turning these models into phylogenetic reconstruction methods [HI [221 HH1 EH |30l 
129] [3] . These reconstruction methods are appealing because they can produce calibrated confidence 
assessments |30j . they leverage more information than pure substitution models |17| . and they are 
generally less susceptible to biases induced by conditioning on a single alignment [37]. 

Model-based phylogenetic methods require scoring each tree in a large collection of hypothesized 
trees. This collection of proposed trees is generated by a Markov chain Monte Carlo (MCMC) algo- 
rithm in Bayesian methods, or by a search algorithm in frequentist methods. Scoring one hypothesized 
tree then boils down to computing the probability of the observed sequences given the tree. This com- 
putation requires summing over all possible ancestral sequences and derivations (sets of insertions, 
deletions and substitutions on the fixed tree that can yield the observed sequences). Since we put no 
bounds on the length of the strings, these summations range over a countably infinite space, creat- 
ing a challenging problem. Previous work has focussed on using tools from combinatorics, dynamic 
programming, and graph theory to attack this problem [HI \TZ\ \FZ[ |2"H IT51 [5\ U\ |2"T[ 155] . 

In this work, we explore a different perspective, purely based on matrix algebra. We use this 
formulation to obtain simplified marginalization algorithms. In the general cases, our representation 
improves the asymptotic worst-case running time of existing transducer algorithms [27 ]. We also 
characterize a subclass of problems, triangular problems, where the running time can be further 
improved. We show, for example, that marginalization problems derived from the popular TKF91 
[51] are triangular. We get as a corollary a new complexity analysis of inference in TKF91 models. 
The bounds for TKF91 on star trees and perfect trees coincide with those obtained from previous 
work [21] . but our proving method is easier to extend to other models. 

All the methods described here are still exponential in the number of taxa. However, they can be 
applied to approximate inference in large datasets using existing Gibbs sampling methods 28J. This 
is done by augmenting the state space of the MCMC algorithm, for example with internal sequences. 
The running time then depends on the number of taxa involved in the local tree perturbation. This 
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number can be as small as two, but there is a trade-off between the mixing rate of the MCMC chain 
and the computational cost of each move. 

There has been previous work on applying matrix algebra to automata [31, 20j, however most of 
this previous work focusses on probability distributions over a single string (unary potentials), which 
are insufficient to express non-trivial phylogenetic trees. Going from unary to binary interactions 
requires the detailed treatment of e transition introduced here. Other work has modelled interactions 
in the context of stochastic automata network [H f^], and use methods reminiscent to ours, but 
these interactions model synchronization steps between parallel systems rather than graphical model 
potentials. 

2 Formal description of the problem 

2.1 Notation 

We use t to denote a rooted phylogeny, with topology {Y,S')^ The root of r is denoted by f2; the 
parent of v £ f,v ^ f2, by pa(w) € T 7 ; the branch lengths by b(v) — b(pa(v) —> v); and the set of 
leaves by if C f . 

The models we are interested in are tree-shaped graphical models [THl H] H] where nodes are string- 
valued random variables X v over a finite alphabet E (for example a set of nucleotides or amino acids) . 
There is one such random variable for each node in the topology of the phylogeny, v £ "V, and the 
variable X v can be interpreted as an hypothesized biological sequence at one point in the phylogeny. 

Each edge (pa(i>) — > v) denotes evolution from one species to another, and is associated with 
a conditional probability V T (X V = s'lXp^^ = s) = 9 v (s,s'), where s,s' € S* are strings (finite 
sequences of characters from £). These conditional probabilities can be derived from the marginals 
of continuous time stochastic process [3U [TH |24l [23] , or from a parametric family [TBI [30] [23 121] • We 
also denote the distribution at the root by P t (Xq = s) = 9(s), which is generally set to the stationary 
distribution in reversible models or to some arbitrary initial distribution otherwise. Usually, only the 
sequences at the terminal nodes are observed, an event that we denote by & = (X v — x v : v e if). 

We denote the probability of the data given a tree by P r (^). The computational bottleneck of 
model-based methods is to repeatedly compute P T (^) for different hypothesized phylogenies r. This 
is the case not only in maximum likelihood estimators, but also in Bayesian methods, where a ratio 
of the form P T /(^)/P r (^) is the main contribution to the Metropolis-Hastings ratio driving Markov 
Chain Monte Carlo (MCMC) approximations. 

While it is more intuitive to describe a phylogenetic tree using a Bayesian network, it is useful to 
transform these Bayesian networks into equivalent factor graphs |2J when designing inference methods. 
A factor graph encodes a weight function over a space X assumed to be countable in this work. For 
example, this weight function could be a probability distribution, but it is convenient to drop the 
requirement that it sums to one. In phylogenetics, the space X is a tuple of \f\ strings, X = (£*)! 

A factor graph (see Figure [l] for an example) is based on a specific bipartite graph, built by letting 
the first bipartite component correspond to 'f (the nodes denoted by circles in the figure), and the 
second bipartite component, to a set of potentials or factors W described shortly (the nodes denoted 
by squares in the figure). 

We put an edge in the factor graph between w € W and v € "V if the value taken by w depends 
on the string at node v. Formally, a potential w € W is a map taking elements of the product space, 
(x v : v G 'f) G X , and returning a non- negative real number. However the value of the factors we 
consider only depend on a small subset of the variables at a time. More precisely, we will use two 
types of factors: unary factors, which depend on a single variable, and binary factors, which depend 
on two variables. We use the abbreviation SGF for such string-valued factor graph. 

reversible models such as TKF91, an arbitrary root can be selected without changing the likelihood. In this case, 
the root is used for computational convenience and has no special phylogenetic meaning. 
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Figure 1: Factor graph construction (right) derived from a phylogenetic tree (left): in green, the unary 
factor at the root capturing the initial string distribution; in blue, the binary factors capturing string 
mutation probabilities; and in red, the unary factor at the leaves, which are indicator functions on 
the observed strings. 



In order to write a precise expression for the probability of the data given a tree, F T (W), we con- 
struct a factor graph (shown in Figure[l]) with three types of factors: one unary potential Wfj(s) = 8(s) 
at the root modeling the initial string distribution; binary factors w„y(s,s') = 9 v >(s,s') capturing 
string mutation probabilities between pairs of species connected by an edge (v — > v') in the phyloge- 
netic tree; and finally, unary factors w v (s) — l[s = x v ] attached to each leaf v € Jzf, which are Dirac 
deltas on the observed strings. 

From this factor graph, the likelihood can be written as: 



(s„gE*:l)6f)£(E*)l''l \ve& 




For any functions /j : Xi — > R + over a countable spaces X ', we let ]X fi denote their pointwise product, 
and we let ||/|| denote the sum of the function over the space, 



11/11 = £/(*)■ 

The likelihood can then be expressed as follows 



n 



w 



(1) 



So far, we have seen each potential w as a black box, focussing instead on how they interact with each 
other. In the next section, we describe the form assumed by each individual potential. 



2.2 Weighted automata and transducers 

Weighted automaton (respectively, transducers), provide one way to define a function from the space 
of strings (respectively, the space of pairs of strings) to the non-negative reals. At a high level, 
weighted automata and transducers proceed by extending a simpler weight function defined over a 
finite, auxiliary space Q into a function over the space of strings via a collection of paths in Q. 

We explain how this is done, starting with automata (unary potentials). First, we designate one 
state in the state space Q to be the start state and one to be the end state (we need only consider one 
of each without loss of generality). We define a path as a sequence of states qt € Q and emissions € 
£ = EU{e}, starting at the start state and ending at the stop state: p = (qo, b\, qi, &2, ffei • • ■ ,0n,a„). 
Here the set of possible emissions E is the set of characters extended with the empty emission erl 

2 Note that we use the convention of using a when the character is an element of the extended set of characters S. 
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Let us now assign a non-negative number to each triplet formed by a transition (pair of state 
q, q' € Q) and emission a € E^] We call this map w : £ x Q 2 — > [0, oo), a parameter specified by the 
user |f] 

Next, we extend the input space of w in two steps. First, if w takes a path p as input, we define 
w(p) as the product of the weights of the consecutive triplets (qi,&i+i,qi+i) in p. Second, if w takes 
a string s £ S* as input, we define w(s) as the sum of the weights of the paths p emitting s, where a 
path p emits a string s if the list of characters in p, omitting e, is equal to s. 

For transducers (binary potentials), we modify the above definition as follows. First, emissions 
become pairs of symbols in E x S. Second, paths are similarly extended to emit pairs of symbols at 
each transition, p ~ (q , (ft, (02, a' 2 ), 92, ■ ■ ■ > (oVn &' n ), q n ). Finally a path emits a pair of strings 

(s, s') if the concatenation of the first characters of the emissions, a%, &2, ■ ■ ■ , & n , is equal to s after 
removing the e symbols, and similarly for s' with the second characters of each emission, &{, a' 2 , ■ ■ ■ , <j' n . 
For example, p = (qo, (a, e), q%, (b, c), 92, (e, e), q-n) emits the pair of strings (ab, c). 

The normalization of a transducer or automata is the sum of all the valid paths' weights, Z = \\w\\. 
We set the normalization to positive infinity when the sum diverges. 

We now have reviewed all the ingredients required to formalize the problem we are interested in: 
Problem description: Computing Equation ([I]), where each w G If is a weighted automaton or 
transducer organized into a tree-shaped SFG. 

3 Background 

In the previous section, we have shown how the normalization of a single factor is defined (how to 
compute it in practice is a another issue, that we address in Section [5]). In this section, we review how 
to transform the problem of computing the normalization of a full factor graph, Equation ([!]), into 
the problem of finding the normalization of a single unary factor. This is done using the elimination 
algorithm [2], a classical method used to break complex computations on graphical models into a 
sequence of simpler ones. 

The elimination algorithm is typically applied to find the normalization of factor graphs with finite- 
valued or Gaussian factors. Since our factors take values in a different space, the space of strings, it is 
useful in this section to take a different perspective on the elimination algorithm: instead of viewing 
the algorithm as exchanging messages on the factor graph, we view it as applying transformation on 
the topology of the graph. These operations alter the normalization of individual potentials, but, as 
we will show, they keep the normalization of the whole factor graph invariant, just as in standard 
applications of the elimination algorithm. 

The elimination algorithm starts with the initial factor graph, and eliminates one leaf node at each 
step (either a variable node or a factor node, there must be at least one of the two types). This process 
creates a sequence of factor graphs with one less nodes at each step, until there are only two nodes 
left (one variable and one unary factor). While the individual factors are altered by this process, the 
global normalization of each intermediate factor graph remains unchanged. 

We show in Figure [2] an example of this process, and a classification of the operations used to 
simplify the factor graph. We call the operations corresponding to a factor eliminations a pointwise 
product (Figure [2^b,c)), and the operations corresponding to variable eliminations, marginalization 
(Figure [2^ a)). We further define two kinds of pointwise products: those where the variable connected 
to the eliminated factor is also connected to a binary factor (the first kind, shown in Figure [2lb)), 




and the others (second kind, shown in Figure |2fc)). 

Let W denote a set of factors, and let W denote a new set of factors obtained by applying one 
of the three operations described above. For the elimination algorithm to be correct, the following 

3 In technical terms, we use the Mealy model. 

4 The inverse problem, learning the values for this map is also of interest I15| . but here we focus on the forward 
problem. 
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(b) Pointwise product 
(first kind) 




(b) 



(c) Pointwise product 
(second kind) 





(a) D ( c ) 



Figure 2: Left: The tree graph transformations used in the elimination algorithm. Right: An example 
of the application of these three operations on a graph induced by a phylogenetic tree. 



invariant should hold [2]: 







n 









4 Indexed matrices 

In this section, we describe the representation that forms the basis of our normalization method and 
establish the main properties of this representation. We use TKF91 [34] as a running example to 
illustrate the construction. 



4.1 Unary factors 

We assume without loss of generality that the states of the automata are labelled by the integers 
1, . . . , K, and that state 1 is the start state, and K, the stop state. 

We represent an automaton as a character-indexed collection, M, of K x K matrices, where K is 
the size of the state space of the automaton, and the index runs over the extended characters: 

M = (m & e u K (m + ) ) 

Each entry M&{k, k') encodes the weight of transitioning from states k to k' while emitting a G £. 

An indexed collection M specifies a corresponding weight function wm '■ — > [0, oo) as follows: 
let p denote a valid path, p = qo, ox, qi, <Ji, q2, ■ ■ ■ , & n , q n , and set 

w M (j?) = <t>(f{ M ^ ' 

where (j> selects the entry (1,K) corresponding to the product of weights starting from the initial and 
ending in the final state, (f>(M) — M(1,K). The product denotes a matrix multiplication over the 
matrices indexed by the emissions in p, in the order of occurrence. 
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Figure 3: An initial geometric length distribution unary factor. On the left, we show the standard 
graph-based automaton representation, where states are dashed circles (to differentiate them from 
nodes in SFGs), and arcs are labelled by emissions and weights. Zero weight arcs are omitted. The 
start and stop state are indicated by inward and outward arrows, respectively on the first and second 
state here. On the right, we show the indexed matrix representation of the same unary potential. 
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Figure 4: A unary factor encoding a string indicator function. In this example, only the string 'aba' 
is accepted (i.e. the string 'aba' is the only string emitted with positive weight). 



In the TKF91 model for example, an automaton is needed to model the geometric root distribution, 
9(s). We have in Figure [3] the classical state diagram as well as the indexed matrix representation, 
in the case where the length distribution has mean 1/(1 — q — p), and the proportion of 'a' symbols 
versus 'b' symbols is p/q. 

Another important example of an automaton is one that gives a weight of one to a single observed 
string, and zero otherwise: w v (s) = l[s = x v ]. We show the state diagram and the indexed matrix 
representation in Figure [4] Note that the indexed matrices are triangular in this case, a property that 
will have important computational consequence in Section [6] 



4.2 Binary factors 

We now turn to the weighted transducers, which we viewed as a collection of matrices as well, but 
this time with an index running over pairs of symbols (a, a') E S: 




Note that in the context of transducers, epsilons need to be kept in the basic representation, otherwise 
transducers could not give positive weights to pairs of strings where the input has a different length 
than the output. 

In phylogenetics, transducers represent the key component of the probabilistic model: the prob- 
ability of a string given its parent, w„y(s,s') = 9 v /(s, s'). For example, in TKF91 models, this 
probability is determined by three parameters: an insertion rate A > 0, a deletion rate fi > 0, and a 
substitution rate matrix Q. Given a branch of length t, the probability of all derivations between s 
and s' can be marginalized using the transducer shown in Figure [5l where the values of a, ft and P 
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Figure 5: A binary factor encoding a TKF91 marginal. 



are obtained by solving a system of differential equations yielding [34], for all a. a' e £: 

exp(— 

A(l - cxp((A - /M) 
fx — A exp((A — 
/48(f) 
A(l-a(t)) 
(exp(iQ))(a, </), 

and 7TO- is the stationary distribution of the process with rate matrix Q. While the standard description 
of TKF uses three states, our formalism allows us to express it with two states, with a state diagram 
shown in Figure [5} Note also that for simplicity we have left out the description of the 'immortal link' 
[34], but it can be handled without increasing the number of states. This is done by introducing an 
extra symbol # ^ £ flanking the end of the sequence. 



a(t) = 

m = 

7(0 = 



5 Operations on indexed matrices 

In this section, we give a closed-form expression for all the operations described in Section [2] We start 
with the simplest operations, epsilon-removal and normalization, and then cover marginalization and 
pointwise products. 

5.1 Unary operations 

While epsilon symbols are convenient when defining new automata, some of the algorithms in the next 
section assume that there are no epsilon transitions of positive weights. Formally, a factor M is called 
epsilon-free if M e — 0. Fortunately, converting an arbitrary unary factor into an epsilon-free factor, 
can be done using the well known generalization of the geometric series formula to matrices. But in 
order to apply this result to our situation, we need to assume that in all positive weight path, the 
last emission is never an epsilon. Fortunately, this can be done without loss of generality by adding a 
boundary symbol at the end of each string. We assume this thorough the rest of this paper and get: 

Proposition 1. Let M denote a unary factor such that each of the eigenvalues Xi of M e satisfies 
| A^l < 1. Then the transformed factor defined as: 

M (f)f0 ifo = e 

a \ M*M a otherwise 

M* = (1 - M)- 1 

is epsilon-free and assigns the same weights to strings, i.e. wm{s) = w M a)(s) for all s G £*. 

We show a proof of this elementary result to illustrate the conciseness of our indexed matrices 
notation: 
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Proof. Note first that the condition on the eigenvalues implies that (1 — M) is invertible. 

To prove equivalence of the weight functions wm and w M (t> , let s € S* be a string of length N. We 
need to show that the epsilon-removed automaton assigns the same weights w M (r> (s) to strings 
as the original automaton My. 

w M (s) = <j>lj2Il M A 

M* 1 M S1 M e fc 2 M S2 ■ ■ ■ M* N M Sl 

K (k u ...,k N )eN N 

\n=l Vm=0 / / 

= %(<) ( s ) 

Here we used the nonnegativity of the weights, which implies that if the automaton has a finite 
normalization, then the infinite sums above are absolutely convergent, and can therefore be rearranged. 

□ 

We also need the following elementary result on the normalizer of arbitrary automata. 
Proposition 2. Let M denote a unary potential, and define the matrix: 

with eigenvalues Xi . Then the normalizer Zm of the unary potential is given by: 

{<j)(M* ) if for alli,\\i\ < 1 
^ +oo otherwise 

where 4>{M) = M(1,K). 

Proof Note first that for all indexed matrices M, we have the following identity: 

JV 



N 



e n^= e^i =* N - 

(a u ...,a N )et L ™ =1 Ves 



Therefore by decomposing the set of all paths by their path lengths N, we obtain: 

o N 

i E E II 17 

y N=0 (<T 1 ,...,<T JV )eS i ™ =1 
\N=0 ) 

The infinite sum in the last line converges to M* if and only if the eigenvectors satisfy |Aj| < 1, and 
diverges to +oo otherwise since the weights are non-negative. 

□ 
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5.2 Binary operations 

We now describe how the operations of marginalization and pointwise product can be efficiently 
implemented using our framework. We begin with the marginalization operation. 

The marginalization operation that we first focus on, shown in Figure [2] takes as input a binary 
factor M, and returns a unary factor M^ m \ A precondition of this operation is that one of the nodes 
connected to M should have no other potentials attached to it. We use the variable a' for the values 
of that node. This node is eliminated from the factor graph after applying the operation. The other 
node is allowed to be attached to other factors, and its values are denoted by a. 

Proposition 3. If M is a binary factor, then the unary factor defined as: 

satisfies w M <m> (s) = J2 S 'eT,' w m(s, s') for all s € S*. 

Proof. We first note that for any matrices Af, t € S* with non-negative entries, we have 

oo oo 

£££^=££^ 

t££* N=0 t^ N t N=0 t e £N 

where we write s s if s has length L and s ~> s. 
We now fix s 6 E. Applying the result above, with 

N 



s~~> N s n—1 



for all s', and where N = |s'|, we get: 

£ w M (s,s') = <j) 




£ £ n 

, N=0 s-^ N s 71=1 £ e -£ 

W M (m) (S) 



□ 



Next, we move to the pointwise multiplication operations, where the epsilon transitions need a 
special treatment. In the following, we use the notation {A|S|C| . . . } to denote the tensor product of 
the matrices A, B, C, . . . jf] 

5 Note that we avoided the standard tensor product notation Cg> because of a notation conflict with the automaton 
and transducer literature, in which ® denotes multiplication in an abstract semi-ring (the generalization of normal 
multiplication, • used here). The operator (g) is also often overloaded to mean the product or concatenation of automata 
or transducers, which is not the same as the pointwise product as defined here. 
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*( M a 
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V' M f 




Figure 6: Counter-example used to illustrate that the epsilon-free condition in Proposition [3] is neces- 
sary. Here a is an arbitrary constant in the open interval (0, 1). 



Proposition 4. If M^- 1 ' and be two epsilon-free unary factors, then the unary factor M ( - p - ) 
defined as: 

M<M = {M«A4 2 )} (2) 

is epsilon-free and satisfies: 

w M (i) (s) ■ w M{ 2) (s) = w Mip) (s) (3) 

for all s G S* . 

Proof. Clearly, is equal to the zero matrix, so is epsilon-free. To show that it satisfies 

Equation [5J we first note that since M^> is epsilon free, 

for i e {1,2}. Hence: 

w MW (s) ■ w Mm ( 8 ) = ^ ( e n a 4 i} ] ( e n M < (1) 



' J Af (p) (■ 



n m ^ i; 



Ml 2 



forallseS*. □ 

Note that the epsilon-free condition is necessary. Consider for example the unary potential M over 
S = {a} shown in Figure [6j We claim that this factor M (which has positive epsilon transitions) does 
not satisfy the conclusion of Proposition [4] Indeed, for s = a, we have that (wm(s)) 2 = a 4 + 2a 3 + a 2 , 
but w M ( P ) (s) = a 4 + a 2 , where is defined as in Equation [5j 

We now turn to pointwise products of the second kind, where an automaton is pointwise multiplied 
with a transducer. The difference is that all epsilons cannot be removed from a transducer: without 
emissions of the form (c, e), (e, o~), transducers would not have the capacity to model insertions and 
deletions. Fortunately, as long as the automaton is epsilon free, pointwise multiplications of the 
second kind can be implemented as follows: 
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Proposition 5. If is epsilon free and is any transducer, then the transducer defined 

by: 



a. a' 



'<l,|Mf} if**e 
. ifefP^Jjl otherwise 



satisfies 

w M (d (s, s') ■ w M{ 2) (s) = io M ( P ) (s, s') 

for all s, s' G S* . 

Proof. Note first that since M^ 2 ) is epsilon-free, we have 

7 if s„ = e 



, Mr' otherwise 

qGs n— 1 

for all TV > |s|, s ~~>n s. Moreover, for all matrix Aj, </>(Si Ai) = <j>(Ai), hence: 

/ oo n \ / \ 



i )( S , s ')-- (2) ( S ) = ^(E E E IKkWn^ 

W=0 s~~> N s s'~~* N s n=l / \ctEs 

JV \ / N 



OO / IV \ / f 

e e e 4n<k 4n 

N=0 s-^- N s s'-^jy s \n—l / \n—l ^ 

E E e 4n< p k 

TV— s^ns s'-^-jvs \n— 1 / 

%w(s, s') 



/ if s„ = e 
M?' otherwise 



□ 



Now that we have a simple expression for all of the operations required by the elimination algorithm 
on string- valued graphical models, we turn to the problem of analyzing the time complexity of exact 
inference in SFGs. 



6 Complexity analysis 

In this section, we start by demonstrating the gains in efficiency brought by our framework in the 
general case, that is without assuming any structure on the potentials. We then show that by adding 
one extra assumption on the factors, triangularity, we get further efficiency gains. We conclude the 
section by giving several examples where triangularity holds in practice. 

6.1 General SFGs 

Here we compare the local running-time of the indexed matrix representation with previous approaches 
based on graph algorithms |27j . By local, we mean that we do the analysis for a single operation at the 
time. We present global complexity analyses for some important special cases in the next section. The 
results here are stated in terms of K, the size of an individual transducer, but note that in sequence 
alignment problems, K grows in terms of L because of potentials at the leaves of the tree. This is 
explained in more details in the next section. 
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Normalization: Previous work has relied on generalizations of shortest path algorithms to Conway 
semirings [JJ, which run in time 0(K 3 ). In contrast, since our algorithm is expressed in terms of 
a single matrix inversion, we achieve a running time of 0(K U ), where uj < 2.3727 is the exponent 
of the asymptotic complexity of matrix multiplication [36j. Note that even though the size of 
the original factors' state spaces may be small (for example of the order of the length of the 
observed sequences), the size of the intermediate factors created by the tensor products used in 
the elimination algorithm makes K grow quickly. 

Epsilon- removal: Classical algorithms are also cubic [25]. Since we perform epsilon removal using 
one matrix multiplication followed by one matrix inversion, we achieve the same speedup of 
0[K 3 ~ U ). Note however that this operation does not preserve sparsity in the general case, both 
with the classical and the proposed method. This issue is addressed in the next section. 

Pointwise products: The operation corresponding to pointwise products in the previous transducer 
literature is the intersection operation [51 127|. In this case, the complexity of our algorithm is 
the same as previous work, but our algorithm is easier to implement when one has access to a 
matrix library. Moreover, tensor products preserve sparsity. More precisely, we use the following 
simple results: if A has k nonzero entries, and B, I nonzero entries, then forming the tensor 
product takes time kl. We assume throughout this section that matrices are stored in a sparse 
representation. 

Note that these running times scale linearly in |S| for operations on unary factors, and quadratically 
in |E| for operations on binary factors. 

6.2 Triangular SFGs 

When executing the elimination algorithm on SFGs, the intermediate factors produced by pointwise 
multiplications are fed into the next operations, augmenting the size of the matrices that need to be 
stored. We show in this section that by making one additional assumption, part of this growth can 
be managed using sparsity and properties of tensor products. 

We will cover the following two SFG topologies: the star SFG, and the perfect binary SFG, shown 
in Figure [7j We denote the number of leaves by L, and we let N and c be bounds on the sizes of the 
unary potentials at the leaves and the binary potentials respectively (by a size N potential M, we 
mean that the matrices M a have size N by N). For example, in the phylogenetics setup, N is equal 
to an upper bound on the lengths of the sequences being analyzed, L is the number of taxa under 
study, and c = 2 in the case of a TKF91 model. We start our discussion with the star SFG 

In Figure [7J we show the result of the first two steps of the elimination algorithm]^] The operations 
involved in this first step are pointwise products of the second kind. The intermediate factors produced, 
are each of size cN. The second step is a marginalization, which does not increase 
the size of the matrices. We therefore have: 




Computing the right-hand-side naively would involve inverting an (cN) L by (cN) L dense matrix — 

even if the matrices Mjp are sparse, there is no a priori guarantee for (m^ 1 ^ to be sparse as well. 

This would lead to a slow running time of (cN) Lu . In the rest of this section, we show that this can 
be considerably improved by using properties of tensor products and an extra assumptions on the 
factors: 

6 We omit the factor at the root since its effect on the running time is a constant independent of L and N. 
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(b) 






Figure 7: (a) The binary topology, (b) The star topology with the first two steps of the elimination 
algorithm on that topology. Note that in both cases we ignore the root factor, which only affect the 
running time by a constant independent of L and N. 



Definition 6. A factor, transducer or automaton is called triangular if its states can be ordered in 
such a way that all of its indexed matrices are upper triangular. A SFG is triangular if all of its leaf 
potentials are triangular. 

Note that we do not require that all the factors of SFG be triangular. In the case of TKF91 for 
example, only the factors at the leaves are triangular. This is enough to prove the main proposition 
of this section: 

Proposition 7. If a star-shaped or perfect binary triangular SFG has L unary potentials of size N 
and binary potentials of size c then the normalizer of the SFG can be computed in time (cN) L . 

In order to prove this result, we need the following standard properties [19]: 

Lemma 8. Let A™' be m x k matrices, and be k x n matrices. We have: 



A (L) B (L) 



A^ 



Lemma 9. If A^' are invertible, then: 



A* 



A (i) 



A^ 



Lemma 10. If A is a mx k invertible matrix, and B is a k x n matrix such that A — B is invertible, 
then (A^B)* ^A(A-B)- 1 . 

The next lemma states that in a sense triangular factors are contagious: 

Lemma 11. If a factor is triangular, then marginalizing it or making it epsilon-free creates a trian- 
gular potential. Moreover, the outcome of a pointwise product (either of the first or second kind) is 
guaranteed to be triangular whenever at least one of its input automaton or transducer is triangular. 



Proof. For marginalization and pointwise products, the result follows directly from Proposition [3j |4j 
and [5j For epsilon-removal, it follows from Proposition [T] and the fact that the inverse of a triangular 
matrix is also triangular. □ 

We can now prove Proposition [7] 

Proof. Note first that by Lemma |ll[ the matrices is upper triangular. Since the intermediate 
factors represent probabilities, !o M (o(s) = V T (Xf l — s,^), we have that Me < 1. It follows 

that the eigenvalues of are strictly smaller than one, so Proposition [l] can be applied to M W for 
each i E {!,■■■ , L}. 



13 



Next, by applying Proposition [2] we get: 

Z = JE{(M«)*Mw|...|(Mf))*Jl4 i )} 
This expression can be simplified using Lemma 8][9 and 10 to get: 



(MP)* ... (u^WUt 



AD 



Mi 



Mi 1 ) 



Mi L > M« 



Mi 



^(^(A-i?)- 1 ), 



where M = / — M, A = < M, 



r(l) 



M, 



(i) 



and B = £- eS M^ 



r(l) 



This equation allows us to exploit sparsity patterns. By Lemma 
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Mf) 



the matrices M; 1 ' are 



upper triangular sparse matrices (more precisely: cN by ciV matrices with cN non zero entries), 
so the matrices A and B are upper triangular as well, with only 0((cN) L ) non-zero components. 
Furthermore, if we let C = (A — B) , we can see that only its last column x is actually needed to 
compute Z. At the same time, we can write (^4 — B)x = [0, 0, . . . , 0, 1], which can be solved using the 
back substitution algorithm, getting an overall running time of 0((cN) L ). 

Note that in the above argument, the only assumption we have used on the factors at the leaves 
is triangularity. The same argument hence applies when these factors come from the elimination 
algorithm executed on a binary tree. The running time for the binary tree case can therefore be 
established by an inductive argument on the depth where the inductive step is the same as the 
argument above. □ 



7 Discussion 

By exploiting closed-form matrix expressions rather than specialized automaton and transducer algo- 
rithms, our method gains several advantages. First, our algorithms are easier to understand since they 
can be built on top of well-known matrix algebra results. Second, our algorithms have asymptotic 
running times better or equal to previous transducer algorithms. Third, by building implementations 
on top of existing matrix packages, our method automatically gets access to optimized libraries, mak- 
ing it easier for example to exploit GPU parallelization from off-the-shelf libraries such as gpumatrix 
[25) . Fourth, our method also has potential applications as a proving tool. Our concise closed- form 
expression for exact inference could be used for instance to study properties of various indel processes. 

In phylogenies that are too large to be handled by exact inference algorithms, our results can still 
be used as building blocks for approximate inference algorithms. For example, most existing MCMC 
samplers for TKF91 models can be expressed as normalization problems on a small subtree of the 
original phylogeny. The simplest way to obtain this subtree is to fix the value of the strings at all 
internal nodes except for two adjacent internal nodes. Resampling a new topology for these two nodes 
amounts to computing the normalization of two SFGs over a subtree of four leaves (their ratio appears 
in the Metropolis-Hastings ratio) [28]. Our approach can also serve as the foundation of Sequential 
Monte Carlo (SMC) approximations [33j[l0l[5]. In this case, the ratio of normalizations appears as a 
particle weight update. 
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